Code
merged_data <- readRDS("data/merged_2013_2014.rds")
merged_n <- nrow(merged_data)A Capstone Project on Bayesian Applications in Epidemiologic Modeling
Namita Mishra
Autumn Wilcox
November 2, 2025
Slides: slides.html (Edit slides.qmd.)
Diabetes mellitus (DM) remains a major public health challenge, and identifying key risk factors—such as obesity, age, sex, and race/ethnicity—is essential for prevention and targeted intervention. Logistic regression is widely used to estimate associations between such factors and binary outcomes like diabetes diagnosis. However, classical maximum likelihood estimation (MLE) can produce unstable estimates in the presence of missing data, quasi-separation, or small samples. Bayesian logistic regression offers a robust alternative by integrating prior information, regularizing estimates, and quantifying uncertainty more transparently than frequentist approaches.
Bayesian hierarchical models, implemented via Markov Chain Monte Carlo (MCMC), have been successfully applied in predicting patient health status across diseases such as pneumonia, prostate cancer, and mental disorders (Zeger et al. 2020). By representing predictive uncertainty alongside point estimates, Bayesian inference offers a practical advantage in epidemiologic modeling where decisions hinge on probabilistic thresholds. Beyond stability, Bayesian methods support model checking, variable selection, and uncertainty quantification under missingness or imputation frameworks (Baldwin and Larson 2017; Kruschke and Liddell 2017).
Recent work has expanded Bayesian applications to disease diagnostics and health risk modeling. For instance, Bayesian approaches have been used to evaluate NHANES diagnostic data (Chatzimichail and Hatjimihail 2023), to model cardiovascular and metabolic risk (Liu et al. 2013), and to integrate multiple data modalities such as imaging and laboratory measures (Abdullah, Hassan, and Mustafa 2022). Moreover, multiple imputation combined with Bayesian modeling generates robust estimates when data are missing at random (MAR) or not at random (MNAR) (Austin et al. 2021).
The broader Bayesian literature emphasizes the role of priors and model checking. Weakly informative priors, such as \(N(0, 2.5)\) for coefficients, regularize estimation and reduce variance in small samples (Gelman et al. 2008; Vande Schoot et al. 2021). Tutorials using R packages like brms and blavaan illustrate how MCMC enables posterior inference and empirical Bayes analysis (Klauenberg et al. 2015).
Beyond standard generalized linear models, Bayesian nonparametric regression flexibly captures nonlinearity and zero inflation common in health data (Richardson and Hartman 2018). Bayesian Additive Regression Trees (BART) improve variable selection in mixed-type data (Luo et al. 2024), while state-space and dynamic Bayesian models incorporate time-varying biomarkers for longitudinal prediction (Momeni et al. 2021). Bayesian model averaging (BMA) further addresses model uncertainty by weighting across multiple specifications (Hoeting et al. 1999). Together, these approaches demonstrate the versatility and growing importance of Bayesian inference in clinical and epidemiologic modeling.
The objective of this project is to evaluate whether Bayesian inference provides more stable and interpretable estimates of diabetes risk than frequentist and imputation-based approaches, particularly when data complexity or separation challenges arise. Agreement across modeling frameworks supports the robustness of these associations and highlights the interpretability and uncertainty quantification advantages offered by Bayesian analysis in population health modeling (National Center for Health Statistics (NCHS) 2014).
The present study employs Bayesian logistic regression to predict diabetes status and examine the relationships between diabetes and key predictors, including body mass index (BMI), age (≥20 years), sex, and race. Using retrospective data from the 2013–2014 NHANES survey, the analysis accounts for the study’s complex sampling design, which involves stratification, clustering, and the oversampling of specific subpopulations rather than simple random sampling. The Bayesian framework is applied to address common analytical challenges such as missing data, complete case bias, and data separation, thereby improving the robustness and reliability of inference compared to traditional logistic regression methods.
The Bayesian framework integrates prior knowledge with observed data to generate posterior distributions, allowing parameters to be interpreted directly in probabilistic terms.
Unlike traditional frequentist approaches that yield single-point estimates and p-values, Bayesian methods represent parameters as random variables with full probability distributions.
This provides greater flexibility, incorporates parameter uncertainty, and produces credible intervals that directly quantify the probability that a parameter lies within a given range.
Bayesian logistic regression models the log-odds of a binary outcome as a linear combination of predictors:
\[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k \]
where
In the Bayesian framework, model parameters (\(\boldsymbol{\beta}\)) are treated as random variables and assigned prior distributions that reflect existing knowledge or plausible ranges before observing the data. After incorporating the observed evidence, the priors are updated through Bayes’ theorem (Leeuw and Klugkist 2012; Klauenberg et al. 2015):
\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]
This formulation allows uncertainty to propagate naturally through the model, producing posterior distributions for each coefficient that can be directly interpreted as probabilities.
Weakly informative priors were used to regularize estimation without imposing strong assumptions:
Such priors help stabilize estimation in the presence of multicollinearity, limited sample size, or potential outliers.
Posterior distributions of regression coefficients were used to estimate the probability of the outcome for given predictor values. This allows statements such as:
Posterior predictions account for two key sources of uncertainty:
In Bayesian analysis, all unknown quantities—coefficients, means, variances, or probabilities—are treated as random variables described by their posterior distributions.
Model quality and convergence were assessed using standard Bayesian diagnostics:
This study used publicly available 2013–2014 NHANES data published by the CDC’s National Center for Health Statistics (National Center for Health Statistics (NCHS) 2014). Three component files were utilized: DEMO_H (demographics), BMX_H (body measures), and DIQ_H (diabetes questionnaire). Each file was imported in .XPT format using the haven package in R, and merged using the unique participant identifier SEQN to create a single adult analytic dataset (age ≥ 20 years).
All variables were coerced to consistent numeric or factor types prior to merging to ensure atomic columns suitable for survey-weighted analysis and modeling. The use of SEQN preserved respondent integrity across datasets and ensured accurate record linkage. This preprocessing step standardized variable formats and minimized inconsistencies between files.
Data wrangling, cleaning, and merging were performed in R using a combination of base functions and tidyverse packages. Bayesian logistic regression modeling was later implemented using the brms interface to Stan, allowing probabilistic inference within a reproducible workflow that accommodated the NHANES complex survey design and missing data considerations.
The merged dataset contains 10,175 participants. It integrates demographic, examination, and diabetes questionnaire data. We then restrict the sample to adults (age ≥ 20) to define the analytic cohort used in subsequent analyses. A small proportion of records have missing values in BMI and diabetes status, which will be addressed later through multiple imputation.
| RIDAGEYR | BMXBMI | RIAGENDR | RIDRETH1 | DIQ010 |
|---|---|---|---|---|
| 69 | 26.7 | 1 | 4 | 1 |
| 54 | 28.6 | 1 | 3 | 1 |
| 72 | 28.9 | 1 | 3 | 1 |
| 9 | 17.1 | 1 | 3 | 2 |
| 73 | 19.7 | 2 | 3 | 2 |
| 56 | 41.7 | 1 | 1 | 2 |
| 0 | NA | 1 | 3 | NA |
| 61 | 35.7 | 2 | 3 | 2 |
| 42 | NA | 1 | 2 | 2 |
| 56 | 26.5 | 2 | 3 | 2 |
Response Variable:
diabetes_ind (binary) represents a Type 2 diabetes diagnosis, excluding gestational diabetes. It was derived from DIQ010 (“Doctor told you have diabetes”), while DIQ050 (insulin use) was excluded to prevent treatment-related confounding.
Predictor Variables:
BMXBMI – Body Mass Index (categorical; six levels as bmi_cat)RIDAGEYR – Age (continuous, 20–80 years)RIAGENDR – Sex (factor, two levels)RIDRETH1 – Ethnicity (factor, five levels)library(knitr)
library(kableExtra)
library(dplyr)
library(tibble)
# -----------------------------
# Variable descriptions (with `code` formatting for names)
# -----------------------------
var_tbl <- tribble(
~Variable, ~Description, ~Type, ~Origin,
"`diabetes_ind`","Type 2 diabetes diagnosis (1 = Yes, 0 = No) derived from `DIQ010`; gestational diabetes excluded.", "Categorical", "Derived from `DIQ010`",
"`age`", "Age in years.", "Continuous", "NHANES `RIDAGEYR`",
"`bmi`", "Body Mass Index (kg/m^2) computed from measured height and weight.", "Continuous", "NHANES `BMXBMI`",
"`bmi_cat`", "BMI categories: Underweight, Normal, Overweight, Obesity I–III (`Normal` is reference in models).", "Categorical", "Derived from `bmi`",
"`sex`", "Sex of participant (`Male`, `Female`).", "Categorical", "NHANES `RIAGENDR`",
"`race`", "Race/Ethnicity with five levels (e.g., Non-Hispanic White, Non-Hispanic Black, Mexican American).", "Categorical", "NHANES `RIDRETH1`",
"`WTMEC2YR`", "Examination sample weight for Mobile Examination Center participants.", "Weight", "NHANES design",
"`SDMVPSU`", "Primary Sampling Unit used for variance estimation in the complex survey design.", "Design", "NHANES design",
"`SDMVSTRA`", "Stratum identifier used to define strata for the complex survey design.", "Design", "NHANES design",
"`age_c`", "Centered and standardized age (z-score).", "Continuous", "Derived from `age`",
"`bmi_c`", "Centered and standardized BMI (z-score).", "Continuous", "Derived from `bmi`"
)
kbl(
var_tbl,
caption = "Variable Descriptions: Adult Analytic Dataset",
align = c("l","l","l","l"),
escape = FALSE
) %>%
kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped","hover")) %>%
group_rows("Analysis variables", 1, 6) %>% # <-- updated range (now 6 analysis rows)
group_rows("Survey design variables", 7, 9) %>%
group_rows("Derived variables", 10, 11)| Variable | Description | Type | Origin |
|---|---|---|---|
| Analysis variables | |||
| `diabetes_ind` | Type 2 diabetes diagnosis (1 = Yes, 0 = No) derived from `DIQ010`; gestational diabetes excluded. | Categorical | Derived from `DIQ010` |
| `age` | Age in years. | Continuous | NHANES `RIDAGEYR` |
| `bmi` | Body Mass Index (kg/m^2) computed from measured height and weight. | Continuous | NHANES `BMXBMI` |
| `bmi_cat` | BMI categories: Underweight, Normal, Overweight, Obesity I–III (`Normal` is reference in models). | Categorical | Derived from `bmi` |
| `sex` | Sex of participant (`Male`, `Female`). | Categorical | NHANES `RIAGENDR` |
| `race` | Race/Ethnicity with five levels (e.g., Non-Hispanic White, Non-Hispanic Black, Mexican American). | Categorical | NHANES `RIDRETH1` |
| Survey design variables | |||
| `WTMEC2YR` | Examination sample weight for Mobile Examination Center participants. | Weight | NHANES design |
| `SDMVPSU` | Primary Sampling Unit used for variance estimation in the complex survey design. | Design | NHANES design |
| `SDMVSTRA` | Stratum identifier used to define strata for the complex survey design. | Design | NHANES design |
| Derived variables | |||
| `age_c` | Centered and standardized age (z-score). | Continuous | Derived from `age` |
| `bmi_c` | Centered and standardized BMI (z-score). | Continuous | Derived from `bmi` |
The National Health and Nutrition Examination Survey (NHANES) employs a complex, multistage probability sampling design with stratification, clustering, and oversampling of specific demographic groups (for example, racial/ethnic minorities and older adults) to produce nationally representative estimates of the U.S. population.
Survey design variables — primary sampling units (SDMVPSU), strata (SDMVSTRA), and examination sample weights (WTMEC2YR) — were retained to account for this complex design. These variables were applied to adjust for unequal probabilities of selection, nonresponse, and oversampling, ensuring valid standard errors, unbiased prevalence estimates, and generalizable population-level inference.
A survey-weighted logistic regression model was used to evaluate the association between diabetes status (diabetes_ind, binary outcome) and key predictors: body mass index (bmi), age (age), sex (sex), and race/ethnicity (race). Diabetes was defined using DIQ010 (“Doctor told you have diabetes”) and coded as 0/1, with DIQ050 (insulin use) excluded to avoid treatment-related confounding.
Covariates included:
- age (continuous; centered as age_c, categorized 20–80 years)
- bmi (continuous; centered as bmi_c, and categorized by BMI class bmi_cat)
- sex (male, female)
- race (five ethnicity levels)
This approach accounts for NHANES’ complex sampling design, producing unbiased parameter estimates and valid inference for U.S. adults.
| Step | Description |
|---|---|
| Weighting | Used the survey package to calculate weighted means and standard deviations for all variables. |
| Standardization | Centered and standardized BMI and age (bmi_c, age_c) for use in regression models. |
| Age Categorization | Recoded into intervals: 20–<30, 30–<40, 40–<50, 50–<60, 60–<70, and 70–80 years. |
| BMI Categorization | Recoded as: <18.5 (Underweight), 18.5–<25 (Normal), 25–<30 (Overweight), 30–<35 (Obesity I), 35–<40 (Obesity II), ≥40 (Obesity III). |
| Ethnicity Recoding | Recoded as: 1 = Mexican American, 2 = Other Hispanic, 3 = Non-Hispanic White, 4 = Non-Hispanic Black, 5 = Other/Multi. |
| Special Codes | Transformed nonresponse codes (e.g., 3, 7) to NA. These missing codes were evaluated for potential nonrandom patterns (MAR/MNAR). |
| Missing Data | Retained and visualized missing values to assess their pattern and informativeness before multiple imputation. |
| Final Dataset | Created the cleaned analytic dataset (adult) using Non-Hispanic White and Male as reference groups for modeling. |
mean SE
age 47.496 0.3805
mean SE
diabetes_ind 0.10391 0.0052
# Design effect and effective sample size for `diabetes_ind`
v_hat <- as.numeric(survey::svyvar(~diabetes_ind, nhanes_design_adult, na.rm = TRUE))
p_hat <- mean(adult$diabetes_ind, na.rm = TRUE)
n_obs <- nrow(adult)
v_srs <- p_hat * (1 - p_hat) / n_obs
deff <- v_hat / v_srs
n_total <- sum(weights(nhanes_design_adult), na.rm = TRUE)
ess <- as.numeric(n_total / deff)
cat("Design effect for diabetes_ind:", round(deff, 2), "\n")Design effect for diabetes_ind: 4777.96
Effective sample size for diabetes_ind: 47960
Descriptive statistics for continuous and categorical variables are presented below.
# Keep only analytic variables for Table 1
tbl1_dat <- adult %>%
transmute(
age,
bmi,
bmi_cat,
sex,
race4,
# CHANGE: make diabetes_ind a factor so it plays nice with pivot_longer()
diabetes_ind = factor(diabetes_ind, levels = c(0, 1), labels = c("No", "Yes"))
)
# Continuous summaries: n, missing, mean, sd, min, max
cont_vars <- c("age", "bmi")
cont_sum <- tbl1_dat %>%
select(all_of(cont_vars)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "value") %>%
group_by(Variable) %>%
summarise(
N = sum(!is.na(value)),
Missing = sum(is.na(value)),
Mean = mean(value, na.rm = TRUE),
SD = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(across(c(Mean, SD, Min, Max), ~round(.x, 2)))
# Categorical summaries: counts and percents
cat_vars <- c("sex", "race4", "diabetes_ind", "bmi_cat")
cat_sum <- tbl1_dat %>%
# CHANGE: ensure all cat vars are factors and include NA as an explicit "(Missing)" level
mutate(across(all_of(cat_vars),
~ forcats::fct_explicit_na(as.factor(.x), na_level = "(Missing)"))) %>%
select(all_of(cat_vars)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Level") %>%
count(Variable, Level, name = "n") %>%
group_by(Variable) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
arrange(Variable, desc(n))
kable(cont_sum, caption = "Table 1a. Continuous variables (age, BMI): N, missing, mean (SD), range.")| Variable | N | Missing | Mean | SD | Min | Max |
|---|---|---|---|---|---|---|
| age | 5769 | 0 | 49.11 | 17.56 | 20.0 | 80.0 |
| bmi | 5520 | 249 | 29.10 | 7.15 | 14.1 | 82.9 |
| Variable | Level | n | pct |
|---|---|---|---|
| bmi_cat | Overweight | 1768 | 30.6 |
| bmi_cat | Normal | 1579 | 27.4 |
| bmi_cat | Obesity I | 1145 | 19.8 |
| bmi_cat | Obesity II | 519 | 9.0 |
| bmi_cat | Obesity III | 419 | 7.3 |
| bmi_cat | (Missing) | 249 | 4.3 |
| bmi_cat | Underweight | 90 | 1.6 |
| diabetes_ind | No | 4870 | 84.4 |
| diabetes_ind | Yes | 722 | 12.5 |
| diabetes_ind | (Missing) | 177 | 3.1 |
| race4 | White | 2472 | 42.8 |
| race4 | Hispanic | 1275 | 22.1 |
| race4 | Black | 1177 | 20.4 |
| race4 | Other | 845 | 14.6 |
| sex | Female | 3011 | 52.2 |
| sex | Male | 2758 | 47.8 |
Table 1a and 1b summarize the analytic variables included in subsequent models. Mean age and BMI values indicate an adult cohort spanning a wide range of body composition, while categorical summaries confirm balanced sex representation and sufficient sample sizes across race/ethnicity categories. These variables were standardized and used as predictors in all modeling frameworks.
| SDMVPSU | SDMVSTRA | WTMEC2YR | diabetes_ind | bmi | age | sex | race | age_c | bmi_c | bmi_cat | race4 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 112 | 13481.04 | 1 | 26.7 | 69 | Male | NH Black | 1.1324183 | -0.3358861 | Overweight | Black |
| 1 | 108 | 24471.77 | 1 | 28.6 | 54 | Male | NH White | 0.2783598 | -0.0702810 | Overweight | White |
| 1 | 109 | 57193.29 | 1 | 28.9 | 72 | Male | NH White | 1.3032300 | -0.0283434 | Overweight | White |
| 2 | 116 | 65541.87 | 0 | 19.7 | 73 | Female | NH White | 1.3601672 | -1.3144311 | Normal | White |
| 1 | 111 | 25344.99 | 0 | 41.7 | 56 | Male | Mexican American | 0.3922343 | 1.7609961 | Obesity III | Hispanic |
| 1 | 114 | 61758.65 | 0 | 35.7 | 61 | Female | NH White | 0.6769204 | 0.9222432 | Obesity II | White |
As shown in Table 1, the analytic adult cohort (N = 5,769) includes standardized variables for age and BMI (age_c, bmi_c), categorical indicators for sex and race/ethnicity (race4), and a binary doctor-diagnosed diabetes variable (diabetes_ind).
'data.frame': 5769 obs. of 12 variables:
$ SDMVPSU : num 1 1 1 2 1 1 2 1 2 2 ...
$ SDMVSTRA : num 112 108 109 116 111 114 106 112 112 113 ...
$ WTMEC2YR : num 13481 24472 57193 65542 25345 ...
$ diabetes_ind: num 1 1 1 0 0 0 0 0 0 0 ...
$ bmi : num 26.7 28.6 28.9 19.7 41.7 35.7 NA 26.5 22 20.3 ...
$ age : num 69 54 72 73 56 61 42 56 65 26 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 2 1 2 1 ...
$ race : Factor w/ 5 levels "Mexican American",..: 2 3 3 3 1 3 4 3 3 3 ...
$ age_c : num 1.132 0.278 1.303 1.36 0.392 ...
$ bmi_c : num -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
$ bmi_cat : Factor w/ 7 levels "Underweight",..: 3 3 3 2 6 5 7 3 2 2 ...
$ race4 : Factor w/ 4 levels "White","Hispanic",..: 3 1 1 1 2 1 2 1 1 1 ...
miss_tbl <- tibble::tibble(
Variable = c("`bmi`","`diabetes_ind`"),
Missing_n = c(sum(is.na(adult_eda$bmi)),
sum(is.na(adult_eda$diabetes_ind))),
Missing_pct = round(c(mean(is.na(adult_eda$bmi)),
mean(is.na(adult_eda$diabetes_ind))) * 100, 1)
)
knitr::kable(
miss_tbl,
caption = "Missingness for key analysis variables."
)| Variable | Missing_n | Missing_pct |
|---|---|---|
bmi |
249 | 4.3 |
diabetes_ind |
177 | 3.1 |
Missingness was low overall (~7.3%). Gaps were concentrated in bmi and diabetes_ind; design variables and demographics were complete. Patterns were consistent with MAR, so we used multiple imputation (MICE) prior to modeling.
SDMVPSU SDMVSTRA WTMEC2YR diabetes_ind bmi age
0 0 0 177 249 0
sex race age_c bmi_c bmi_cat race4
0 0 0 249 0 0
diabetes_ind bmi bmi_c age
177 249 249 0
1 2 3 7 9 <NA>
737 8841 185 1 5 406
[1] 1120
Results indicated that missing values were limited primarily to BMI and diabetes indicators, while demographic and survey design variables were complete.
The overall missingness (~4%) was low and appears consistent with data missing at random (MAR). This pattern likely reflects participation differences in the physical examination component among older adults or individuals with health limitations.
Given these findings, multiple imputation using chained equations (MICE) was employed to minimize bias and retain statistical power in subsequent modeling.
Following the missing data assessment, exploratory analyses were conducted to describe the adult analytic cohort and visualize distributions across key demographic and health variables. The goal was to examine univariate patterns and bivariate relationships relevant to diabetes prevalence before modeling.
The adult analytic cohort was broadly representative of the U.S. population, with a majority identifying as Non-Hispanic White. Age and BMI distributions were right-skewed, with most participants classified as overweight or obese. Visual exploration revealed a clear positive relationship between age, BMI, and diabetes prevalence. Non-Hispanic Black and Hispanic participants exhibited higher proportions of diabetes compared to Non-Hispanic Whites.
Approximately 25% of variables were categorical (for example, sex, race4, diabetes_ind) and 75% were continuous (age, bmi, age_c, bmi_c), indicating that the dataset primarily consisted of measured numeric values such as BMI and age. About 93% of rows contained complete information across all predictors and outcomes, reflecting high data quality.
Age was relatively evenly distributed across adult age groups, while BMI was concentrated in the overweight and obese ranges. Female participants were slightly overrepresented relative to Male participants.
# BMI by diabetes outcome (boxplot)
# (You can’t use boxplot with categorical y, so revert to numeric BMI here)
ggplot(adult, aes(x = factor(diabetes_ind, levels = c(0,1), labels = c("No","Yes")), y = bmi)) +
geom_boxplot(fill = "lightblue") +
labs(title = "BMI by Diabetes Diagnosis (≥20 years)", x = "Diabetes (No/Yes)", y = "BMI (numeric)") +
theme_minimal()# Diabetes by race4 (dodged bars)
ggplot(adult, aes(x = race4, fill = factor(diabetes_ind, levels = c(0,1), labels = c("No","Yes")))) +
geom_bar(position = "dodge") +
labs(title = "Diabetes Diagnosis by Race/Ethnicity (≥20 years)",
x = "Race/Ethnicity (race4)", y = "Count", fill = "Diabetes") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The EDA missingness summary shows approximately 4.3% missing BMI and 3.1% missing diabetes status (diabetes_ind). All design variables (WTMEC2YR, SDMVPSU, SDMVSTRA), as well as age, sex, and race4, are complete—sex and race NAs are encoded as explicit “(Missing)” levels in the EDA view.
Three modeling frameworks were compared using identical predictors (standardized age, BMI, sex, and race4) and the binary outcome diabetes_ind: (1) survey-weighted logistic regression to incorporate the NHANES complex sampling design, (2) multiple imputation (MICE) to address missing BMI values, and (3) Bayesian logistic regression with weakly informative priors to quantify uncertainty.
adult_clean <- adult %>%
dplyr::mutate(
sex = forcats::fct_drop(sex),
race4 = forcats::fct_drop(race4),
age_c = as.numeric(age_c),
bmi_c = as.numeric(bmi_c)
) %>%
dplyr::filter(!is.na(diabetes_ind), !is.na(age_c), !is.na(bmi_c)) %>%
droplevels()
str(adult_clean[, c("diabetes_ind","sex","race4","age_c","bmi_c")])'data.frame': 5349 obs. of 5 variables:
$ diabetes_ind: num 1 1 1 0 0 0 0 0 0 1 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 2 1 2 ...
$ race4 : Factor w/ 4 levels "White","Hispanic",..: 3 1 1 1 2 1 1 1 1 1 ...
$ age_c : num 1.132 0.278 1.303 1.36 0.392 ...
$ bmi_c : num -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
options(survey.lonely.psu = "adjust")
nhanes_design_adult <- survey::svydesign(
id = ~SDMVPSU,
strata = ~SDMVSTRA,
weights = ~WTMEC2YR,
nest = TRUE,
data = adult_clean
)
des_cc <- subset(nhanes_design_adult, !is.na(diabetes_ind) &
!is.na(age_c) & !is.na(bmi_c) &
!is.na(sex) & !is.na(race4))
# Sanity checks before modeling
stopifnot(nlevels(adult_clean$sex) >= 2)
stopifnot(nlevels(adult_clean$race4) >= 2)
table(adult_clean$sex)
Female Male
2798 2551
White Hispanic Black Other
2293 1183 1101 772
svy_fit <- survey::svyglm(
diabetes_ind ~ age_c + bmi_c + sex + race4,
design = des_cc,
family = quasibinomial()
)
svy_or <- broom::tidy(svy_fit, conf.int = TRUE) %>%
dplyr::mutate(
OR = exp(estimate),
LCL = exp(conf.low),
UCL = exp(conf.high)
) %>%
dplyr::select(term, OR, LCL, UCL, p.value) %>%
dplyr::filter(term != "(Intercept)")Design-based odds ratios are summarized in Table 2.
| term | OR | LCL | UCL | p.value |
|---|---|---|---|---|
| age_c | 2.977668 | 2.704677 | 3.278212 | 0.0000000 |
| bmi_c | 1.930284 | 1.679190 | 2.218924 | 0.0000021 |
| sexMale | 1.287236 | 1.018664 | 1.626618 | 0.0373081 |
| race4Hispanic | 1.809943 | 1.428957 | 2.292507 | 0.0003024 |
| race4Black | 1.599844 | 1.157393 | 2.211434 | 0.0094754 |
| race4Other | 2.195356 | 1.461219 | 3.298334 | 0.0017976 |
The NHANES 2013–2014 data use a complex, multistage probability design involving strata (SDMVSTRA), primary sampling units (PSUs; SDMVPSU), and examination weights (WTMEC2YR) to ensure nationally representative estimates (National Center for Health Statistics (NCHS) 2014).
Estimates are population-weighted using NHANES survey design variables (WTMEC2YR, SDMVSTRA, SDMVPSU). Odds ratios are reported per one standard deviation (1 SD) increase in age and BMI, with reference groups Male and White.
Multiple Imputation by Chained Equations (MICE) was used as a principled approach for handling missing data (Stef van Buuren and Groothuis-Oudshoorn 2011; S. van Buuren 2012).
MICE iteratively imputes each incomplete variable using regression models based on other variables in the dataset, producing multiple completed datasets that reflect uncertainty due to missingness. Estimates are then pooled across imputations using Rubin’s rules to generate final parameter estimates and confidence intervals.
MICE, as an alternative to the Bayesian approach, effectively manages missing data through chained regression equations without requiring full joint modeling of all variables.
For large sample sizes (n ≥ 400), even in the presence of high percentages (up to 75%) of missing data in one variable, non-normal distributions such as flat densities, heavy tails, skewness, and multimodality do not materially affect mean structure estimation performance (S. van Buuren 2012).
In this study, continuous variables (age and BMI) were imputed using predictive mean matching (PMM) to preserve realistic distributions, while categorical variables (sex and race4) were imputed using logistic and polytomous regression models, respectively. Diabetes status (diabetes_ind) was treated as an outcome variable and was not imputed. Twenty imputations were generated to reduce Monte Carlo error and maintain robust variance estimation.
adult_imp1 <- mice::complete(imp, 1) %>%
dplyr::mutate(
sex = factor(sex, levels = levels(adult$sex)),
race4 = factor(race4, levels = levels(adult$race4)),
age_c = as.numeric(scale(age)),
bmi_c = as.numeric(scale(bmi)),
wt_norm = WTMEC2YR / mean(WTMEC2YR, na.rm = TRUE)
) %>%
dplyr::filter(!is.na(diabetes_ind), !is.na(age_c), !is.na(bmi_c)) %>%
droplevels()invisible(capture.output({
bayes_fit <- brms::brm(
formula = fml_bayes,
data = adult_imp1,
family = bernoulli(link = "logit"),
prior = priors,
chains = 4, iter = 2000, seed = 123,
control = list(adapt_delta = 0.95),
refresh = 0
)
}))
bayes_or <- brms::posterior_summary(bayes_fit, pars = "^b_") %>%
as.data.frame() %>%
tibble::rownames_to_column("raw") %>%
dplyr::mutate(
term = gsub("^b_", "", raw),
OR = exp(Estimate),
LCL = exp(Q2.5),
UCL = exp(Q97.5)
) %>%
dplyr::select(term, OR, LCL, UCL)| term | OR | LCL | UCL |
|---|---|---|---|
| Intercept | 0.06 | 0.05 | 0.07 |
| age_c | 2.93 | 2.62 | 3.30 |
| bmi_c | 1.92 | 1.76 | 2.10 |
| sexMale | 1.28 | 1.06 | 1.55 |
| race4Hispanic | 1.82 | 1.40 | 2.38 |
| race4Black | 1.62 | 1.23 | 2.12 |
| race4Other | 2.11 | 1.46 | 2.99 |
Multiple imputation preserves sample size and reduces bias from missing BMI values. Results closely mirror the survey-weighted model, confirming robustness to imputation.
miss_age <- sum(is.na(mi_dat$age))
miss_bmiN <- sum(is.na(mi_dat$bmi))
mi_caption <- if (miss_age > 0 && miss_bmiN > 0) {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing age (normal) and BMI (PMM) (m = 20); diabetes status was not imputed."
} else if (miss_bmiN > 0) {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing BMI (PMM) (m = 20); diabetes status was not imputed."
} else if (miss_age > 0) {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing age (normal) (m = 20); diabetes status was not imputed."
} else {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals (no variables required imputation); diabetes status was not imputed."
}
mi_caption <- paste0(mi_caption, " Odds ratios are per 1 SD for age and BMI.")
knitr::kable(mi_or, caption = mi_caption)| term | OR | std.error | statistic | df | p.value | LCL | UCL | conf.low | conf.high | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | scale(age) | 2.8956646 | 0.0524433 | 20.273592 | 5499.574 | 0.0000000 | 2.6127544 | 3.2092084 | 2.6127544 | 3.2092084 |
| 3 | scale(bmi) | 1.8053391 | 0.0430294 | 13.728961 | 3877.226 | 0.0000000 | 1.6592839 | 1.9642506 | 1.6592839 | 1.9642506 |
| 4 | relevel(sex, “Male”)Female | 0.8056102 | 0.0872427 | -2.477631 | 5545.705 | 0.0132553 | 0.6789653 | 0.9558776 | 0.6789653 | 0.9558776 |
| 5 | relevel(race4, “White”)Hispanic | 2.0741944 | 0.1128115 | 6.467183 | 5562.036 | 0.0000000 | 1.6626591 | 2.5875915 | 1.6626591 | 2.5875915 |
| 6 | relevel(race4, “White”)Black | 1.7931172 | 0.1137153 | 5.135240 | 5508.172 | 0.0000003 | 1.4348045 | 2.2409110 | 1.4348045 | 2.2409110 |
| 7 | relevel(race4, “White”)Other | 2.0011166 | 0.1443011 | 4.807344 | 5464.535 | 0.0000016 | 1.5080503 | 2.6553940 | 1.5080503 | 2.6553940 |
Bayesian logistic regression was implemented using the following model specification:
Formula:
diabetes_ind | weights(wt_norm) ~ age_c + bmi_c + sex + race4
stopifnot(all(c("diabetes_ind","age","bmi") %in% names(adult_imp1)))
correlation_matrix <- cor(
adult_imp1[, c("diabetes_ind", "age", "bmi")],
use = "complete.obs",
method = "pearson"
)
correlation_melted <- melt(correlation_matrix, varnames = c("Var1","Var2"), value.name = "value")
ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limits = c(-1, 1), name = "Correlation") +
theme_minimal() +
labs(title = "Correlation Heatmap (adult_imp1)", x = "Features", y = "Features")Running MCMC with 4 sequential chains...
Chain 1 finished in 10.9 seconds.
Chain 2 finished in 10.2 seconds.
Chain 3 finished in 11.1 seconds.
Chain 4 finished in 10.6 seconds.
All 4 chains finished successfully.
Mean chain execution time: 10.7 seconds.
Total execution time: 43.2 seconds.
Posterior odds ratios and credible intervals from the Bayesian logistic regression are shown in Table 3.
As shown in Table 3, the Bayesian logistic regression model estimated the log-odds of diabetes using standardized predictors. Weakly informative priors (\(N(0, 2.5)\) for slopes, Student-t(3, 0, 10) for the intercept) stabilized estimation and prevented overfitting. The model used normalized NHANES exam weights as importance weights to approximate design-based inference. Posterior means and 95% credible intervals provided full uncertainty quantification for each predictor.
Posterior summaries were further evaluated using the Bayesian \(R^2\), which estimates the proportion of outcome variance explained by model predictors.
Model-level performance is summarized in Table 5.
| Estimate | Est.Error | Q2.5 | Q97.5 | |
|---|---|---|---|---|
| R2 | 0.1380082 | 0.0118579 | 0.115269 | 0.1616909 |
# Get diagnostics as plain columns (use metric names as strings)
diag <- posterior::summarise_draws(
bayes_fit,
"rhat", "ess_bulk", "ess_tail"
)
# Keep only slope parameters (b_*) and build a clean table.
diag_b <- diag |>
dplyr::as_tibble() |>
dplyr::filter(grepl("^b_", .data$variable)) |>
dplyr::transmute(
Parameter = .data$variable,
Rhat = .data$rhat,
Bulk_ESS = .data$ess_bulk,
Tail_ESS = .data$ess_tail
)
knitr::kable(diag_b, digits = 1)| Parameter | Rhat | Bulk_ESS | Tail_ESS |
|---|---|---|---|
| b_Intercept | 1 | 2721.2 | 2817.8 |
| b_age_c | 1 | 2605.7 | 2704.7 |
| b_bmi_c | 1 | 3229.5 | 2783.0 |
| b_sexMale | 1 | 3613.7 | 3128.7 |
| b_race4Hispanic | 1 | 3805.4 | 3109.7 |
| b_race4Black | 1 | 3685.6 | 3133.0 |
| b_race4Other | 1 | 3513.0 | 2637.5 |
All model parameters achieved R̂ values approximately equal to 1.00 and bulk/tail effective sample sizes exceeding 2,000, confirming strong convergence and well-mixed chains. The Bayesian R² was approximately 0.13, indicating that age, BMI, sex, and race collectively explained about 13% of variability in diabetes risk at the population level.
Model comparison results using leave-one-out cross-validation are presented below.
# Reduced models (let brms compile if needed)
invisible(capture.output({
fit_no_race <- update(bayes_fit, formula = update(fml_bayes, . ~ . - race4))
fit_no_sex <- update(bayes_fit, formula = update(fml_bayes, . ~ . - sex))
}))
# Compute LOO for all three models
loo_base <- loo::loo(bayes_fit)
loo_no_race <- loo::loo(fit_no_race)
loo_no_sex <- loo::loo(fit_no_sex)
# Compare (pass objects positionally)
cmp <- loo::loo_compare(loo_base, loo_no_race, loo_no_sex)
# Make row names explicit and sort best-to-worst already by loo_compare
cmp_df <- as.data.frame(cmp)
cmp_df$Model <- rownames(cmp_df)
cmp_df <- cmp_df[, c("Model", setdiff(names(cmp_df), "Model"))]
knitr::kable(
cmp_df,
caption = "LOO comparison (higher elpd_loo is better; table sorted best to worst)."
)| Model | elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
|---|---|---|---|---|---|---|---|---|---|
| bayes_fit | bayes_fit | 0.000000 | 0.000000 | -1574.573 | 56.98453 | 8.291612 | 0.5425672 | 3149.146 | 113.9691 |
| fit_no_sex | fit_no_sex | -2.014434 | 3.294323 | -1576.587 | 57.14599 | 6.459157 | 0.4351583 | 3153.175 | 114.2920 |
| fit_no_race | fit_no_race | -14.736906 | 6.291956 | -1589.310 | 54.54774 | 5.446023 | 0.3593072 | 3178.620 | 109.0955 |
Leave-one-out (LOO) cross-validation showed that models excluding race or sex had lower expected log predictive density (elpd), indicating a poorer fit. This supports the inclusion of both variables as meaningful contributors to predictive performance and overall model adequacy.
Figures below visualize posterior distributions, MCMC diagnostics, and model fit.
adult_means <- adult_imp1 %>% summarise(
age_mean = mean(age, na.rm = TRUE),
age_sd = sd(age, na.rm = TRUE),
bmi_mean = mean(bmi, na.rm = TRUE),
bmi_sd = sd(bmi, na.rm = TRUE)
)
to_model_row <- function(age_raw, bmi_raw, sex_lab, race4_lab) {
tibble(
age_c = (age_raw - adult_means$age_mean)/adult_means$age_sd,
bmi_c = (bmi_raw - adult_means$bmi_mean)/adult_means$bmi_sd,
sex = factor(sex_lab, levels = levels(adult_imp1$sex)),
race4 = factor(race4_lab, levels = levels(adult_imp1$race4)),
wt_norm = 1
)
}
plot_post_density <- function(df_row, title_txt) {
phat <- posterior_linpred(bayes_fit, newdata = df_row, transform = TRUE)
ci95 <- quantile(phat, c(0.025, 0.975))
ggplot(data.frame(pred = as.numeric(phat)), aes(x = pred)) +
geom_density() +
geom_vline(xintercept = ci95[1], linetype = "dashed") +
geom_vline(xintercept = ci95[2], linetype = "dashed") +
coord_cartesian(xlim = c(0, 1)) +
labs(x = "P(Diabetes = 1)", y = "Density", title = title_txt) +
theme_minimal()
}
# Use participants from analytic data (adult) for realistic inputs
stopifnot(nrow(adult) >= 2)
p1 <- to_model_row(adult$age[1], adult$bmi[1],
as.character(adult$sex[1]), as.character(adult$race4[1]))
p2 <- to_model_row(adult$age[2], adult$bmi[2],
as.character(adult$sex[2]), as.character(adult$race4[2]))
plot_post_density(p1, "Participant 1: Posterior Predictive Distribution (95% CrI)")par_names <- names(posterior::as_draws_df(bayes_fit))
want <- c("b_age_c","b_bmi_c","b_sexFemale","b_race4Black","b_race4Hispanic","b_race4Other")
have <- intersect(want, par_names)
if (length(have) == 0) {
knitr::asis_output("**No matching slope parameters found for mcmc_areas.**")
} else {
bayesplot::mcmc_areas(as.array(bayes_fit), pars = have, prob = 0.95)
}Posterior predictive checks showed that simulated outcome distributions closely matched the observed diabetes prevalence, indicating strong model calibration. Both the mean and standard deviation of replicated outcomes aligned with observed data, suggesting the model adequately captured central tendency and dispersion. These results provide graphical evidence of good fit and reinforce that the priors did not unduly constrain the posterior.
or_plot <- bayes_or %>%
dplyr::filter(term != "Intercept") %>%
dplyr::mutate(
term_clean = dplyr::case_when(
term == "age_c" ~ "Age (per 1 SD)",
term == "bmi_c" ~ "BMI (per 1 SD)",
term == "sexFemale" ~ "Female (vs. Male)",
term == "sexMale" ~ "Male (vs. Female)",
term == "race4Black" ~ "Black (vs. White)",
term == "race4Hispanic" ~ "Hispanic (vs. White)",
term == "race4Other" ~ "Other (vs. White)",
TRUE ~ term
)
)
ggplot(or_plot, aes(x = OR, y = reorder(term_clean, OR))) +
geom_vline(xintercept = 1, linetype = 2) +
geom_point() +
geom_errorbarh(aes(xmin = LCL, xmax = UCL), height = 0.15) +
labs(x = "Odds ratio (logit model, posterior)", y = NULL)Calibration between predicted and observed diabetes probabilities is displayed in Figure 9.
pred_mean <- colMeans(brms::posterior_epred(bayes_fit))
ggplot(data.frame(pred = pred_mean, obs = yobs),
aes(x = pred, y = obs)) +
geom_point(alpha = 0.15, position = position_jitter(height = 0.03)) +
geom_smooth(method = "loess", se = TRUE) +
labs(x = "Mean predicted probability", y = "Observed diabetes (0/1)")post_pred <- brms::posterior_epred(bayes_fit, summary = FALSE)
post_prev <- rowMeans(post_pred)
obs_prev <- mean(adult_imp1$diabetes_ind, na.rm = TRUE)
post_prev_mat <- matrix(post_prev, ncol = 1, dimnames = list(NULL, "Prevalence"))
p <- bayesplot::mcmc_dens(post_prev_mat)
p + ggplot2::labs(title = "Posterior diabetes prevalence",
x = "Predicted prevalence", y = NULL) +
ggplot2::geom_vline(xintercept = obs_prev, linetype = 2)
pThe posterior predictive distribution of diabetes prevalence closely mirrored the survey-estimated prevalence, with the posterior mean aligning within 1% of the observed rate. This indicates that the Bayesian model accurately reproduced the population-level prevalence and supports its calibration for epidemiologic inference.
# Posterior predictive prevalence (replicated datasets)
yrep <- brms::posterior_predict(bayes_fit, ndraws = 2000) # draws x observations (0/1)
post_prev <- rowMeans(yrep) # prevalence each posterior draw
# Survey-weighted observed prevalence (population estimate)
des_obs <- survey::svydesign(
id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
nest = TRUE, data = adult_imp1
)
obs <- survey::svymean(~diabetes_ind, des_obs, na.rm = TRUE)
obs_prev <- as.numeric(obs["diabetes_ind"])
obs_se <- as.numeric(SE(obs)["diabetes_ind"])
obs_lcl <- max(0, obs_prev - 1.96 * obs_se)
obs_ucl <- min(1, obs_prev + 1.96 * obs_se)
# Plot: posterior density with weighted point estimate and 95% CI band
ggplot(data.frame(prev = post_prev), aes(x = prev)) +
geom_density(alpha = 0.6) +
annotate("rect", xmin = obs_lcl, xmax = obs_ucl, ymin = 0, ymax = Inf, alpha = 0.15) +
geom_vline(xintercept = obs_prev, linetype = 2) +
coord_cartesian(xlim = c(0, 1)) +
labs(x = "Diabetes prevalence", y = "Posterior density",
subtitle = sprintf("Survey-weighted NHANES prevalence = %.1f%% (95%% CI %.1f–%.1f%%)",
100*obs_prev, 100*obs_lcl, 100*obs_ucl)) +
theme_minimal()[1] "b_age_c" "b_bmi_c" "b_Intercept" "b_race4Black"
[5] "b_race4Hispanic" "b_race4Other" "b_sexMale"
src_obj <- if (exists("bayes_fit_prior") && !is.null(bayes_fit_prior)) bayes_fit_prior else bayes_fit
draws_df <- posterior::as_draws_df(src_obj)
all_cols <- names(draws_df)
# EDIT THESE if your names differ (check the printed list from list-brms-pars)
want_post <- c("b_age_c","b_bmi_c","b_sexFemale","b_race4Black","b_race4Hispanic","b_race4Other")
have_post <- intersect(want_post, all_cols)
have_prior <- intersect(paste0("prior_", have_post), all_cols)
pairs <- data.frame(post = have_post, prior = paste0("prior_", have_post),
stringsAsFactors = FALSE)
pairs <- pairs[pairs$prior %in% have_prior, , drop = FALSE]
if (nrow(pairs) == 0) {
knitr::asis_output("**No matching prior/posterior parameters found to overlay.**")
} else {
post_long <- tidyr::pivot_longer(
draws_df[, pairs$post, drop = FALSE],
cols = tidyselect::everything(), names_to = "term", values_to = "estimate"
)
post_long$type <- "Posterior"
prior_tmp <- draws_df[, pairs$prior, drop = FALSE]
colnames(prior_tmp) <- pairs$post
prior_long <- tidyr::pivot_longer(
prior_tmp,
cols = tidyselect::everything(), names_to = "term", values_to = "estimate"
)
prior_long$type <- "Prior"
combined_draws <<- dplyr::bind_rows(prior_long, post_long)
lbl <- c(
b_age_c = "Age (1 SD)", b_bmi_c = "BMI (1 SD)",
b_sexFemale = "Female vs Male",
b_race4Black = "Black vs White",
b_race4Hispanic = "Hispanic vs White",
b_race4Other = "Other vs White"
)
combined_draws$term <- factor(
combined_draws$term,
levels = intersect(names(lbl), unique(combined_draws$term)),
labels = lbl[intersect(names(lbl), unique(combined_draws$term))]
)
ggplot2::ggplot(combined_draws, ggplot2::aes(x = estimate, linetype = type)) +
ggplot2::geom_density() +
ggplot2::facet_wrap(~ term, scales = "free", ncol = 2) +
ggplot2::labs(x = "Coefficient (log-odds)", y = "Density", linetype = NULL) +
ggplot2::theme_minimal()
}No matching prior/posterior parameters found to overlay.
if (exists("combined_draws") && is.data.frame(combined_draws) && nrow(combined_draws) > 0) {
ggplot(combined_draws, aes(x = estimate, fill = type)) +
geom_density(alpha = 0.4) +
facet_wrap(~ term, scales = "free", ncol = 2) +
theme_minimal(base_size = 13) +
labs(
title = "Prior vs Posterior Distributions",
x = "Coefficient estimate",
y = "Density",
fill = ""
)
} else {
knitr::asis_output("**Skipped: no matching prior/posterior draws to plot.**")
}Skipped: no matching prior/posterior draws to plot.
par_names <- names(posterior::as_draws_df(bayes_fit))
want <- c("b_age_c","b_bmi_c","b_sexFemale","b_race4Black","b_race4Hispanic","b_race4Other")
have <- intersect(want, par_names)
if (length(have)) {
bayesplot::mcmc_areas(as.array(bayes_fit), pars = have, prob = 0.95)
} else {
knitr::asis_output("**No matching slope parameters available to plot.**")
}For age and BMI, the posterior densities shift notably away from the N(0, 2.5) prior toward positive values and are narrower, indicating strong information from the data; for sex, the posterior remains closer to the prior with more overlap, indicating weaker evidence.
The overlay of prior and posterior densities illustrates that informative updates occurred primarily for BMI, age, and race coefficients, which showed distinct posterior shifts relative to the priors. In contrast, weaker predictors such as sex displayed overlapping distributions, indicating that inference for those parameters was more influenced by prior uncertainty than by the observed data. This balance confirms appropriate regularization rather than overfitting.
A concise summary of posterior estimates is provided below.
# Combine results from all three models
svy_tbl <- if (exists("svy_or") && nrow(svy_or) > 0)
dplyr::mutate(svy_or, Model = "Survey-weighted (MLE)") else NULL
mi_tbl <- if (exists("mi_or") && nrow(mi_or) > 0)
dplyr::mutate(mi_or, Model = "MICE Pooled") else NULL
bayes_tbl <- if (exists("bayes_or") && nrow(bayes_or) > 0)
dplyr::mutate(bayes_or, Model = "Bayesian") %>%
dplyr::filter(term != "Intercept") else NULL
all_tbl <- dplyr::bind_rows(Filter(Negate(is.null), list(svy_tbl, mi_tbl, bayes_tbl))) %>%
dplyr::mutate(
term = dplyr::case_when(
grepl("^bmi", term, ignore.case = TRUE) ~ "BMI (per 1 SD)",
grepl("^age", term, ignore.case = TRUE) ~ "Age (per 1 SD)",
grepl("^sexFemale$", term) ~ "Female (vs. Male)",
grepl("^sexMale$", term) ~ "Male (vs. Female)",
grepl("^race4Hispanic$", term) ~ "Hispanic (vs. White)",
grepl("^race4Black$", term) ~ "Black (vs. White)",
grepl("^race4Other$", term) ~ "Other (vs. White)",
TRUE ~ term
),
OR_CI = sprintf("%.2f (%.2f – %.2f)", OR, LCL, UCL)
) %>%
dplyr::select(Model, term, OR_CI)Comparative odds ratios across frameworks are shown in Table 7.
| Model | term | OR_CI |
|---|---|---|
| Survey-weighted (MLE) | Age (per 1 SD) | 2.98 (2.70 – 3.28) |
| Survey-weighted (MLE) | BMI (per 1 SD) | 1.93 (1.68 – 2.22) |
| Survey-weighted (MLE) | Male (vs. Female) | 1.29 (1.02 – 1.63) |
| Survey-weighted (MLE) | Hispanic (vs. White) | 1.81 (1.43 – 2.29) |
| Survey-weighted (MLE) | Black (vs. White) | 1.60 (1.16 – 2.21) |
| Survey-weighted (MLE) | Other (vs. White) | 2.20 (1.46 – 3.30) |
| MICE Pooled | scale(age) | 2.90 (2.61 – 3.21) |
| MICE Pooled | scale(bmi) | 1.81 (1.66 – 1.96) |
| MICE Pooled | relevel(sex, “Male”)Female | 0.81 (0.68 – 0.96) |
| MICE Pooled | relevel(race4, “White”)Hispanic | 2.07 (1.66 – 2.59) |
| MICE Pooled | relevel(race4, “White”)Black | 1.79 (1.43 – 2.24) |
| MICE Pooled | relevel(race4, “White”)Other | 2.00 (1.51 – 2.66) |
| Bayesian | Age (per 1 SD) | 2.93 (2.62 – 3.30) |
| Bayesian | BMI (per 1 SD) | 1.92 (1.76 – 2.10) |
| Bayesian | Male (vs. Female) | 1.28 (1.06 – 1.55) |
| Bayesian | Hispanic (vs. White) | 1.82 (1.40 – 2.38) |
| Bayesian | Black (vs. White) | 1.62 (1.23 – 2.12) |
| Bayesian | Other (vs. White) | 2.11 (1.46 – 2.99) |
This table summarizes results from the survey-weighted (design-based), multiple-imputation, and Bayesian models.
The Bayesian model’s credible intervals closely align with the frequentist confidence intervals but provide a more direct probabilistic interpretation of uncertainty.
Across all three frameworks—survey-weighted (MLE), multiple imputation, and Bayesian—age and BMI were consistently associated with higher odds of doctor-diagnosed diabetes. Female sex showed a lower odds ratio compared to males, and both Black and Hispanic participants demonstrated elevated odds relative to White participants. The similarity of effect sizes across frameworks underscores the robustness of these predictors to different modeling assumptions and missing-data treatments. Bayesian credible intervals largely overlapped frequentist confidence intervals, confirming stability of inference while allowing direct probabilistic interpretation.
The Bayesian logistic regression framework produced results that were highly consistent with both the survey-weighted and MICE-pooled frequentist models. Age and BMI remained the most influential predictors of doctor-diagnosed diabetes, each showing a strong and positive association with diabetes risk.
Unlike classical maximum likelihood estimation, the Bayesian approach directly quantified uncertainty through posterior distributions, offering richer interpretability and more transparent probability statements. The alignment between Bayesian and design-based estimates supports the robustness of these associations and highlights the practicality of Bayesian modeling for complex, weighted population data.
Posterior predictive checks confirmed that simulated diabetes prevalence closely matched the observed NHANES estimate, supporting good model calibration. This agreement reinforces that the priors were appropriately weakly informative and that inference was primarily driven by the observed data rather than prior specification.
Overall, this study demonstrates that Bayesian inference complements traditional epidemiologic methods by maintaining interpretability while enhancing stability and explicitly quantifying uncertainty in complex survey data. These consistent findings across modeling frameworks underscore the robustness of core risk factors and support the use of Bayesian inference for epidemiologic research involving complex survey data.
While this analysis demonstrates the value of Bayesian logistic regression for epidemiologic modeling, several limitations should be acknowledged.
First, the use of a single imputed dataset for the Bayesian model—rather than full joint modeling of imputation uncertainty—may understate total variance.
Second, NHANES exam weights were normalized and treated as importance weights, which approximate but do not fully reproduce design-based inference.
Third, the weakly informative priors \(N(0, 2.5)\) for slopes and Student-t(3, 0, 10) for the intercept were not empirically tuned; alternative prior specifications could slightly alter posterior intervals.
Finally, although convergence diagnostics (R̂ ≈ 1, sufficient effective sample sizes, and stable posterior predictive checks) indicated good model performance, results are conditional on the 2013–2014 NHANES cycle and may not generalize to later datasets or longitudinal analyses.
The Bayesian, survey-weighted, and imputed logistic regression frameworks all identified consistent predictors of diabetes risk in U.S. adults: advancing age, higher BMI, sex (lower odds for females), and non-White race/ethnicity.
The Bayesian model produced estimates nearly identical in direction and magnitude to the frequentist results while providing a more comprehensive assessment of uncertainty through posterior distributions and credible intervals.
These consistent findings across modeling frameworks underscore the robustness of core risk factors and support the use of Bayesian inference for epidemiologic research involving complex survey data.
By incorporating prior information and using MCMC to sample from the full posterior distribution, Bayesian inference enhances model transparency and interpretability.
Its agreement with traditional approaches underscores that Bayesian methods can be applied confidently in large-scale public health datasets such as NHANES.
Future extensions could integrate hierarchical priors, multiple NHANES cycles, or Bayesian model averaging to better capture population heterogeneity, temporal trends, and evolving diabetes risk patterns.